Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework on ParquetDataset for easy access and better cache size in eager mode #384

Merged
merged 4 commits into from
Aug 5, 2019

Conversation

yongtang
Copy link
Member

@yongtang yongtang commented Jul 27, 2019

This fix is part of the effort to improve overall Dataset for easy access and better cache size in eager mode. See #382 and #366 for related discussions.

In order to be able to read file either in filename or in mmeory, this PR adds an SizedRandomAccessFile which allows to provide an optional memory buffer as file content. This could be useful in process compression or archives where we could just read the uncompressed file content into memory.

The preivous limitation in Dataset was that Dataset was a iterable so sequence length is unknown until graph runtime. In this PR, we provide an helper function to read the columns of parquet file and lenth is know.

This also could open other avenues such as map parquet file with getitem and len.
Further, parquet file could be read into a Tensor and processed easily (such as pandas like API).

The list_parquet_columns could be similarly applied to HDF5 which is more important: HDF5 could have dataset with different sizes.

Summary:

  1. Two basic C++ kernel ops are implemnted: list_parquet_columns and read_parquet
  2. One ParquetDataset that is python implementation only (no C++ anymore)
  3. ParquetDataset support eager and graph mode, in graph mode, dtype and shape
    are provided by user explicitly. In eager mode, only column name is needed.
  4. read_parquet works in eager and graph mode, can read records either in full, or in slices
  5. list_parquet_columns works in eager mode only (limitation).

For cache batch vs. batch in tf.keras

  1. Added a hidden capacity to adjust the cache batch size
  2. batch to be passed in tf.keras is unrelated to capacity, but we could use rebatch
    to change at the end of the pipeline.
  3. capacity could be padded to allow rebatch to only cut a slice over one chunk.
    If not padded to batch_size in tf.keras, then rebatch likely will copy over boundary.

Signed-off-by: Yong Tang yong.tang.github@outlook.com

@yongtang
Copy link
Member Author

/cc @terrytangyuan @BryanCutler @feihugis

/cc @CaptainDuke in case you are interested. I am thinking about apply similar enhancement to HDF5 as well.

@CaptainDuke
Copy link
Contributor

Many thanks to Yongtang.

Yes, actually contents in HDF5 files do not need to decode. Also I'm working on HDF5 files with diffierent size. For example.

# h5ls test_data_level_6/10.hdf5 
atk_diff            Dataset {5120, 1}
emy_vec_5           Dataset {5120, 429}
frame                    Dataset {5120, 1}
global_info              Dataset {5120, 68}
hot_label                Dataset {5120, 1}
hot_weight               Dataset {5120, 1}
img_data                 Dataset {5120, 5, 31, 31}
...

I believe such enhancement would be helpful.
BTW, is #342 issue bug related to this problem?

The preivous limitation in Dataset was that Dataset was a iterable so sequence length is unknown until graph runtime. In this PR, we provide an helper function to read the specs of parquet file and lenth is know.

@yongtang
Copy link
Member Author

@CaptainDuke the issue #342 you are referring to, might not be directly related to this problem. However, the recent changes in upstream tf.data: tensorflow/tensorflow@c5c1839

might make things complicated as we likely will need to update API pretty soon. With the ongoing rework of cache size and tf.io pipeline to interact with tf.data, it might make sense to fix that together with the PR here.

…er mode

This fix is part of the effort to improve overall Dataset for
easy access and better cache size in eager mode.
See 382 and 366 for related discussions.

In order to be able to read file either in filename or in mmeory, this PR
adds an SizedRandomAccessFile which allows to provide an optional memory buffer
as file content. This could be useful in process compression or archives
where we could just read the uncompressed file content into memory.

The preivous limitation in Dataset was that Dataset was a iterable so sequence
length is unknown until graph runtime. In this PR, we provide an helper function
to read the specs of parquet file and lenth is know.

This also could open other avenues such as map parquet file with __getitem__ and __len__.
Further, parquet file could be read into a Tensor and processed easily (such as pandas like API).

The read_parquet_specs could be similarly applied to HDF5 which is more important:
HDF5 could have dataset with different sizes.

Summary:
1) Two basic C++ kernel ops are implemnted: read_parquet_specs and read_parquet
2) One ParquetDataset that is python implementation only (no C++ anymore)
3) ParquetDataset support eager and graph mode, in graph mode, dtype and shape
   are provided by user explicitly. In eager mode, only column name is needed.
4) read_parquet works in eager and graph mode, can read records either in full, or in slices
5) read_parquet_specs works in eager mode only (limitation).

For cache batch vs. batch in tf.keras
1) Added a hidden `capacity` to adjust the cache batch size
2) batch to be passed in tf.keras is unrelated to `capacity`, but we could use `rebatch`
   to change at the end of the pipeline.
3) `capacity` could be padded to allow `rebatch` to only cut a slice over one chunk.
   If not padded to `batch_size` in tf.keras, then `rebatch` likely will copy over boundary.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
@yongtang yongtang merged commit 1642da1 into tensorflow:master Aug 5, 2019
@yongtang yongtang deleted the parquet branch August 5, 2019 21:04
i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
…er mode (tensorflow#384)

* Rework on ParquetDataset for easy access and better cache size in eager mode

This fix is part of the effort to improve overall Dataset for
easy access and better cache size in eager mode.
See 382 and 366 for related discussions.

In order to be able to read file either in filename or in mmeory, this PR
adds an SizedRandomAccessFile which allows to provide an optional memory buffer
as file content. This could be useful in process compression or archives
where we could just read the uncompressed file content into memory.

The preivous limitation in Dataset was that Dataset was a iterable so sequence
length is unknown until graph runtime. In this PR, we provide an helper function
to read the specs of parquet file and lenth is know.

This also could open other avenues such as map parquet file with __getitem__ and __len__.
Further, parquet file could be read into a Tensor and processed easily (such as pandas like API).

The read_parquet_specs could be similarly applied to HDF5 which is more important:
HDF5 could have dataset with different sizes.

Summary:
1) Two basic C++ kernel ops are implemnted: read_parquet_specs and read_parquet
2) One ParquetDataset that is python implementation only (no C++ anymore)
3) ParquetDataset support eager and graph mode, in graph mode, dtype and shape
   are provided by user explicitly. In eager mode, only column name is needed.
4) read_parquet works in eager and graph mode, can read records either in full, or in slices
5) read_parquet_specs works in eager mode only (limitation).

For cache batch vs. batch in tf.keras
1) Added a hidden `capacity` to adjust the cache batch size
2) batch to be passed in tf.keras is unrelated to `capacity`, but we could use `rebatch`
   to change at the end of the pipeline.
3) `capacity` could be padded to allow `rebatch` to only cut a slice over one chunk.
   If not padded to `batch_size` in tf.keras, then `rebatch` likely will copy over boundary.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Fix build failures

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Rename read_parquet_columns => list_parquet_columns

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Remove batch args, and add test in graph mode

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants